1 Executive Summary

Simply put, what we have done here is taken a rough data set, cleaned it, and run some analysis to communicate some interesting observations and predictions regarding AirBnB’s presence in Brussels. We started by converting variables to formats that would be of use to us. Then, we selected certain variables from the raw data set on the basis of their usefulness in conducting regressions, data visualisations, correlation tests, etc. Using these variables, we plotted bar charts, distribution charts, and more. Lastly, we ran regressions and made some predictions.

1.1 Data Visualisation

We approached the data visualisation with the objective of challenging pre-concieved notions we had regarding relationships between a cusomter and seller in the AirBnB context. Through some qualitative analysis, we answered questions such as “What is the relationship between the quality of a host (seller) and the price they charge” and compared results to our hypotheses.

1.2 Regression Analysis

1.2.1 Creating the explained variable

For the regression part, we started by creating a new variable called Price_4_nights to calculate the cost of staying 4 nights at an Airbnb. Given we were looking specifically at the cost for 2 people, we filtered the data to calculate the cost for only Airbnb’s which could accommodate at least 2 people. However, for the regression model, we have instead decided to create and use log_Price_4_nights as the explained variable since its distribution is close to a normal distribution. Before, starting with the different regression models, we split total dataset into trained and tested the data set.

1.2.1.1 Models Results

Model 1
For Model 1, we have tested the significance of property type, the number of reviews and review score rating on the price of an airBnB. At first glance, there is a negative relationship between review score rating and the price for 4 nights at an Airbnb, which seems strange given that normally we would expect properties having higher ratings will have higher prices. However, the negative relationship is very small and is nearly zero and it is not statistically significant. Other variables are significant.Prop_type_simplified is a categorical variable, so the first thing we should understand is this regression is choosing entire condo as a base line. The intercept can be interpreted as an entire condominium (condo) will command a log price_4_nights of 5.883. If another property type is chosen such as a private room in rental unit or a private room in residential home, then the log price will be decreased by 0.563 and 0.430 respectively. This make sense as the price of renting a room will be lower than that of an entire condo. In general, property type is a significant predictor of price of an AirBnb. Checking for collinearity, we can see that this is not an issue here in this model due to VIF being lower than 5. Then we run model 1 on our tested dataset, and RMSE = 0.518
Model 2
In model 2, we find review_score_rating is insignificant, so we drop it in our following regression. For model 2, we want to determine if room type is a significant predictor of the cost for 4 nights and we find out that every room type, except for a hotel room, is a significant predictor of price. Again, checking for collinearity, we can see that this is not an issue here in this model due to VIF being lower than 5. Checking for overfitting, we find the RMSE = 0.504 on tested data set.
Model 3
For model 3, we want to determine if number of bathrooms, bedrooms, bed and size of the house are significant predictors of the cost for 4 nights. The number of beds is not significant predictors of log_price_4_nights. However, the number of bedrooms, bathrooms and size of the house are significant predictors. Given VIF is less than 5, it doesn’t seem that there is any issue of multi-collinearity. Checking for overfitting, we find the RMSE = 0.441 on tested data set.
Model 4
For model 4, we want to understand if superhosts (host_is_superhost) command a pricing premium, after controlling for other variables. At first glance, being a superhost seems command a pricing premium compared to being not. However, it is not statistically significant. So we have 95% confidence to say being a superhost doesn’t command a pricing premium. Given VIF is less than 5, it doesn’t seem that there is any issue of multi-collinearity. We find the RMSE = 0.418 on tested data set.
Model 5
host_is_superhost is not significant, so we don’t include it in our regression. For model 5, we want to see if the fact that some hosts allows you to immediately book their listing may command a price premium compared to those who don’t. We find out that being able to instantly book an Airbnb is a significant predictor of price. Given VIF is less than 5, it doesn’t seem that there is any issue of multi-collinearity. Checking RMSE on tested data set, we find RMSE = 0.441.
Model 6
For model 6, we have created a new variable called neightbourhood_simplified, where we broke down the 19 neighbourhoods in Brussels into 5 neighbourhood based on where they are located in the city of Brussels. We separed the different neighbourhoods into neighbourhoods located in the North West, North East, East/Centre, West/Centre and South/Centre. Location is a good significant predictor of log_price_4_nights as seen by t-statistics. Rooms located in the East won’t have a significant effect on price, however, rooms located in North East, North West, South have significant postive effect on price_4_night. Again, given VIF is less than 5, it doesn’t seem that there is any issue of multi-collinearity. Checking RMSE on tested data set, we find RMSE = 0.441.
Model 7
For model 7, we try to find the effect of the variable avalability_30 or reviews_per_month on log_price_4_nights, after we control for other variables. For this model, we find number_of_reviews is not significant, then we try to replace it with review_scores_rating, then this is significant. This might because reviews_per_month could represent much information of number_of_review, so this variable become insignificant. We also find that availability_30 and reviews_per_month have significant positive effect on price_4_nights. Again, given VIF is less than 5, it doesn’t seem that there is any issue of multi-collinearity. Checking RMSE on tested data set, we find RMSE = 0.3703.

1.2.1.1.1 Prediction

Choosing a model
Model 7 has the highest adjusted R^2, and also the lowest RMSE in testing set, which means model7 has the best explaining ability with no overfitting. So we use model7 for prediction.
Prediction.
Suppose I want to order a private room in rental unit, with one bathroom and 1 bedroom. The size of the room could accommodate 2 people. Also, I want this room to be instant bookable, and its location should be in the North West. Reviews per month should be 1.6, and average rating should be 4.5 and availability in 30 days should be about 15. Based on Model 7, the expected price I should pay for 4 nights is 155.0 Euros, and 95% upper price is 336.2 Euros, and 95% lower price 71.5 Euros.

2 Exploratory Data Analysis (EDA)

2.1 Data wrangling

glimpse(listings)
Rows: 5,442
Columns: 74
$ id                                           <dbl> 2352, 2354, 45145, 48180,~
$ listing_url                                  <chr> "https://www.airbnb.com/r~
$ scrape_id                                    <dbl> 2.021092e+13, 2.021092e+1~
$ last_scraped                                 <date> 2021-09-25, 2021-09-25, ~
$ name                                         <chr> "Triplex-2chmbrs,grande s~
$ description                                  <chr> "Cute 2 bedrooms appartme~
$ neighborhood_overview                        <chr> "Basilique Koekelberg, Ch~
$ picture_url                                  <chr> "https://a0.muscache.com/~
$ host_id                                      <dbl> 2582, 2582, 199370, 21956~
$ host_url                                     <chr> "https://www.airbnb.com/u~
$ host_name                                    <chr> "Oda", "Oda", "Erick", "A~
$ host_since                                   <date> 2008-08-28, 2008-08-28, ~
$ host_location                                <chr> "Belgium", "Belgium", "Br~
$ host_about                                   <chr> "Hi there! I've been a ho~
$ host_response_time                           <chr> "within an hour", "within~
$ host_response_rate                           <chr> "100%", "100%", "N/A", "N~
$ host_acceptance_rate                         <chr> "100%", "100%", "N/A", "N~
$ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, FALS~
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/~
$ host_picture_url                             <chr> "https://a0.muscache.com/~
$ host_neighbourhood                           <chr> "Molenbeek-Saint-Jean", "~
$ host_listings_count                          <dbl> 3, 3, 2, 1, 1, 13, 13, 13~
$ host_total_listings_count                    <dbl> 3, 3, 2, 1, 1, 13, 13, 13~
$ host_verifications                           <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified                       <lgl> FALSE, FALSE, TRUE, FALSE~
$ neighbourhood                                <chr> "Sint-Jans-Molenbeek, Bru~
$ neighbourhood_cleansed                       <chr> "Molenbeek-Saint-Jean", "~
$ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, N~
$ latitude                                     <dbl> 50.85702, 50.85709, 50.85~
$ longitude                                    <dbl> 4.30771, 4.30757, 4.36809~
$ property_type                                <chr> "Entire rental unit", "En~
$ room_type                                    <chr> "Entire home/apt", "Entir~
$ accommodates                                 <dbl> 5, 4, 2, 2, 3, 3, 3, 3, 6~
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text                               <chr> "1 bath", "1 bath", "1 ba~
$ bedrooms                                     <dbl> 2, 1, 1, 2, 1, NA, NA, NA~
$ beds                                         <dbl> 2, 1, 1, 2, 1, 2, 2, 2, 4~
$ amenities                                    <chr> "[\"Baby bath\", \"Luggag~
$ price                                        <chr> "$90.00", "$74.00", "$95.~
$ minimum_nights                               <dbl> 2, 2, 1, 2, 5, 1, 1, 1, 1~
$ maximum_nights                               <dbl> 365, 365, 1125, 14, 120, ~
$ minimum_minimum_nights                       <dbl> 2, 2, 2, 2, 5, 1, 1, 1, 1~
$ maximum_minimum_nights                       <dbl> 2, 2, 2, 2, 5, 1, 1, 1, 1~
$ minimum_maximum_nights                       <dbl> 1125, 1125, 1125, 14, 120~
$ maximum_maximum_nights                       <dbl> 1125, 1125, 1125, 14, 120~
$ minimum_nights_avg_ntm                       <dbl> 2, 2, 2, 2, 5, 1, 1, 1, 1~
$ maximum_nights_avg_ntm                       <dbl> 1125, 1125, 1125, 14, 120~
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30                              <dbl> 16, 23, 19, 30, 2, 28, 23~
$ availability_60                              <dbl> 46, 53, 42, 60, 6, 58, 53~
$ availability_90                              <dbl> 76, 83, 67, 90, 36, 88, 8~
$ availability_365                             <dbl> 256, 358, 337, 365, 311, ~
$ calendar_last_scraped                        <date> 2021-09-25, 2021-09-25, ~
$ number_of_reviews                            <dbl> 17, 2, 3, 0, 105, 5, 62, ~
$ number_of_reviews_ltm                        <dbl> 1, 0, 0, 0, 0, 1, 2, 0, 0~
$ number_of_reviews_l30d                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ first_review                                 <date> 2014-08-26, 2016-04-25, ~
$ last_review                                  <date> 2017-06-30, 2018-10-28, ~
$ review_scores_rating                         <dbl> 4.44, 4.00, 5.00, NA, 4.8~
$ review_scores_accuracy                       <dbl> 4.63, 5.00, 5.00, NA, 4.8~
$ review_scores_cleanliness                    <dbl> 4.69, 5.00, 5.00, NA, 4.8~
$ review_scores_checkin                        <dbl> 4.56, 5.00, 5.00, NA, 4.9~
$ review_scores_communication                  <dbl> 4.75, 5.00, 4.00, NA, 4.9~
$ review_scores_location                       <dbl> 4.00, 5.00, 5.00, NA, 4.8~
$ review_scores_value                          <dbl> 4.44, 5.00, 4.00, NA, 4.7~
$ license                                      <lgl> NA, NA, NA, NA, NA, NA, N~
$ instant_bookable                             <lgl> FALSE, FALSE, TRUE, FALSE~
$ calculated_host_listings_count               <dbl> 2, 2, 2, 1, 1, 15, 15, 15~
$ calculated_host_listings_count_entire_homes  <dbl> 2, 2, 0, 1, 1, 15, 15, 15~
$ calculated_host_listings_count_private_rooms <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0~
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month                            <dbl> 0.20, 0.03, 0.10, NA, 0.9~
skim(listings)
Data summary
Name listings
Number of rows 5442
Number of columns 74
_______________________
Column type frequency:
character 23
Date 5
logical 9
numeric 37
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 5442 0
name 0 1.00 1 242 0 5333 0
description 229 0.96 1 1000 0 4945 0
neighborhood_overview 2230 0.59 1 1000 0 2629 0
picture_url 0 1.00 61 126 0 5311 0
host_url 0 1.00 38 43 0 3426 0
host_name 2 1.00 1 31 0 1902 0
host_location 16 1.00 2 70 0 405 0
host_about 2656 0.51 1 3655 0 1535 5
host_response_time 2 1.00 3 18 0 5 0
host_response_rate 2 1.00 2 4 0 55 0
host_acceptance_rate 2 1.00 2 4 0 83 0
host_thumbnail_url 2 1.00 55 106 0 3390 0
host_picture_url 2 1.00 57 109 0 3390 0
host_neighbourhood 2028 0.63 5 29 0 88 0
host_verifications 0 1.00 4 141 0 188 0
neighbourhood 2230 0.59 7 63 0 126 0
neighbourhood_cleansed 0 1.00 5 21 0 19 0
property_type 0 1.00 4 35 0 45 0
room_type 0 1.00 10 15 0 4 0
bathrooms_text 12 1.00 6 17 0 27 0
amenities 0 1.00 2 1666 0 5025 0
price 0 1.00 5 9 0 290 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2021-09-24 2021-09-25 2021-09-25 2
host_since 2 1.00 2008-08-28 2021-09-19 2015-10-19 2100
calendar_last_scraped 0 1.00 2021-09-24 2021-09-25 2021-09-25 2
first_review 914 0.83 2011-06-06 2021-09-23 2019-05-27 1778
last_review 914 0.83 2010-11-06 2021-09-24 2020-03-11 1219

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 2 1 0.20 FAL: 4366, TRU: 1074
host_has_profile_pic 2 1 0.99 TRU: 5394, FAL: 46
host_identity_verified 2 1 0.87 TRU: 4741, FAL: 699
neighbourhood_group_cleansed 5442 0 NaN :
bathrooms 5442 0 NaN :
calendar_updated 5442 0 NaN :
has_availability 0 1 0.99 TRU: 5382, FAL: 60
license 5442 0 NaN :
instant_bookable 0 1 0.37 FAL: 3446, TRU: 1996

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 3.131539e+07 15866077.88 2.352000e+03 1.824171e+07 3.533347e+07 4.511619e+07 5.242511e+07 <U+2583><U+2583><U+2583><U+2586><U+2587>
scrape_id 0 1.00 2.021092e+13 0.00 2.021092e+13 2.021092e+13 2.021092e+13 2.021092e+13 2.021092e+13 <U+2581><U+2581><U+2587><U+2581><U+2581>
host_id 0 1.00 1.100261e+08 123833524.03 2.582000e+03 1.733301e+07 4.637172e+07 1.759884e+08 4.236817e+08 <U+2587><U+2582><U+2581><U+2581><U+2581>
host_listings_count 2 1.00 9.640000e+00 39.82 0.000000e+00 1.000000e+00 1.000000e+00 4.000000e+00 2.044000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
host_total_listings_count 2 1.00 9.640000e+00 39.82 0.000000e+00 1.000000e+00 1.000000e+00 4.000000e+00 2.044000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
latitude 0 1.00 5.084000e+01 0.02 5.077000e+01 5.083000e+01 5.084000e+01 5.085000e+01 5.090000e+01 <U+2581><U+2583><U+2587><U+2582><U+2581>
longitude 0 1.00 4.360000e+00 0.03 4.260000e+00 4.340000e+00 4.360000e+00 4.380000e+00 4.480000e+00 <U+2581><U+2583><U+2587><U+2582><U+2581>
accommodates 0 1.00 3.010000e+00 1.77 0.000000e+00 2.000000e+00 2.000000e+00 4.000000e+00 1.600000e+01 <U+2587><U+2583><U+2581><U+2581><U+2581>
bedrooms 630 0.88 1.400000e+00 1.05 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 4.000000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
beds 83 0.98 1.710000e+00 1.26 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.600000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights 0 1.00 1.029000e+01 36.19 1.000000e+00 1.000000e+00 2.000000e+00 4.000000e+00 1.125000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights 0 1.00 2.339130e+03 120486.31 1.000000e+00 9.000000e+01 1.125000e+03 1.125000e+03 8.888888e+06 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_minimum_nights 1 1.00 9.910000e+00 35.85 1.000000e+00 1.000000e+00 2.000000e+00 4.000000e+00 1.125000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_minimum_nights 1 1.00 1.050000e+01 36.07 1.000000e+00 1.000000e+00 2.000000e+00 5.000000e+00 1.125000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_maximum_nights 1 1.00 2.458030e+03 120495.62 1.000000e+00 3.650000e+02 1.125000e+03 1.125000e+03 8.888888e+06 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_maximum_nights 1 1.00 2.476170e+03 120495.33 1.000000e+00 3.650000e+02 1.125000e+03 1.125000e+03 8.888888e+06 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights_avg_ntm 1 1.00 1.027000e+01 35.98 1.000000e+00 1.000000e+00 2.000000e+00 4.100000e+00 1.125000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights_avg_ntm 1 1.00 2.472100e+03 120495.39 1.000000e+00 3.650000e+02 1.125000e+03 1.125000e+03 8.888888e+06 <U+2587><U+2581><U+2581><U+2581><U+2581>
availability_30 0 1.00 9.090000e+00 10.77 0.000000e+00 0.000000e+00 3.000000e+00 1.900000e+01 3.000000e+01 <U+2587><U+2582><U+2581><U+2582><U+2582>
availability_60 0 1.00 2.300000e+01 22.48 0.000000e+00 0.000000e+00 2.100000e+01 4.500000e+01 6.000000e+01 <U+2587><U+2582><U+2582><U+2582><U+2583>
availability_90 0 1.00 3.926000e+01 34.54 0.000000e+00 0.000000e+00 4.100000e+01 7.400000e+01 9.000000e+01 <U+2587><U+2582><U+2582><U+2583><U+2585>
availability_365 0 1.00 1.665200e+02 134.04 0.000000e+00 3.500000e+01 1.480000e+02 3.060000e+02 3.650000e+02 <U+2587><U+2583><U+2583><U+2582><U+2586>
number_of_reviews 0 1.00 3.537000e+01 69.70 0.000000e+00 2.000000e+00 8.000000e+00 3.500000e+01 7.820000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews_ltm 0 1.00 5.140000e+00 11.57 0.000000e+00 0.000000e+00 1.000000e+00 5.000000e+00 1.670000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews_l30d 0 1.00 7.600000e-01 1.64 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 2.000000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
review_scores_rating 914 0.83 4.590000e+00 0.65 0.000000e+00 4.500000e+00 4.750000e+00 4.920000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_accuracy 960 0.82 4.720000e+00 0.45 0.000000e+00 4.670000e+00 4.850000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_cleanliness 960 0.82 4.610000e+00 0.51 0.000000e+00 4.500000e+00 4.750000e+00 4.940000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_checkin 960 0.82 4.790000e+00 0.39 0.000000e+00 4.750000e+00 4.900000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_communication 960 0.82 4.770000e+00 0.43 0.000000e+00 4.740000e+00 4.900000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_location 960 0.82 4.730000e+00 0.38 0.000000e+00 4.640000e+00 4.830000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_value 960 0.82 4.600000e+00 0.46 0.000000e+00 4.500000e+00 4.700000e+00 4.860000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
calculated_host_listings_count 0 1.00 7.280000e+00 15.59 1.000000e+00 1.000000e+00 1.000000e+00 4.000000e+00 9.100000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_entire_homes 0 1.00 5.650000e+00 13.72 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 7.800000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_private_rooms 0 1.00 1.560000e+00 4.59 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 4.100000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_shared_rooms 0 1.00 1.000000e-02 0.16 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+00 <U+2587><U+2581><U+2581><U+2581><U+2581>
reviews_per_month 914 0.83 1.370000e+00 1.67 1.000000e-02 2.700000e-01 7.700000e-01 1.840000e+00 1.234000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>

2.1.1 Step 1: Glimpse and Skim Results

2.1.1.1 How many variables/columns? How many rows/observations?

There are 74 variables and 5442 rows

2.1.1.2 Which variables are numbers?

There are 37 numeric variables. They are: - id - scrape_id - host_id - host_listings_count - host_total_listings_count - latitude - longitude - accommodates - bedrooms - beds - minimum_nights - maximum_nights - minimum_minimum_nights - maximum_minimum_nights - minimum_maximum_nights - maximum_maximum_nights - minimum_nights_avg_ntm - maximum_nights_avg_ntm - availability_30 - availability_60 - availability_90 - availability_365 - number_of_reviews - number_of_reviews_ltm - number_of_reviews_130d - review_scores_rating - review_scores_accuracy - review_scores_cleanliness - review_scores_checkin - review_scores_communication - review_scores_location - review_scores_value - calculated_host_listings_count - calculated_host_listings_count_entire_homes - calculated_host_listings_count_private_rooms - calculated_host_listings_count_shared_rooms - reviews_per_month

2.1.1.3 Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values?

Categorical vriables: - host_verifications - host_has_profile_pic - host_identity_verified - neighbourhood - neighbourhood_cleansed - property_type - room_type - has_availability - instant_bookable

listings <- listings %>% 
  mutate(price = parse_number(price),
         bathrooms = parse_number(bathrooms_text))
typeof(listings$price)
[1] "double"
typeof(listings$bathrooms)
[1] "double"

2.1.2 Step 2: Computing summary statistics of the variables of interest, or finding NAs

#Variables of interests
new_listings <- listings %>% 
  select(host_since, host_location, host_response_time, host_response_rate, host_is_superhost, host_neighbourhood, host_listings_count, host_total_listings_count, host_has_profile_pic, host_identity_verified, neighbourhood_cleansed, latitude, longitude, property_type, room_type, accommodates, bathrooms, bedrooms, beds, price, minimum_nights, maximum_nights, minimum_nights_avg_ntm, maximum_nights_avg_ntm, has_availability, number_of_reviews, review_scores_rating, instant_bookable, availability_30,  reviews_per_month)

skim(new_listings)
Data summary
Name new_listings
Number of rows 5442
Number of columns 30
_______________________
Column type frequency:
character 7
Date 1
logical 5
numeric 17
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
host_location 16 1.00 2 70 0 405 0
host_response_time 2 1.00 3 18 0 5 0
host_response_rate 2 1.00 2 4 0 55 0
host_neighbourhood 2028 0.63 5 29 0 88 0
neighbourhood_cleansed 0 1.00 5 21 0 19 0
property_type 0 1.00 4 35 0 45 0
room_type 0 1.00 10 15 0 4 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 2 1 2008-08-28 2021-09-19 2015-10-19 2100

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 2 1 0.20 FAL: 4366, TRU: 1074
host_has_profile_pic 2 1 0.99 TRU: 5394, FAL: 46
host_identity_verified 2 1 0.87 TRU: 4741, FAL: 699
has_availability 0 1 0.99 TRU: 5382, FAL: 60
instant_bookable 0 1 0.37 FAL: 3446, TRU: 1996

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
host_listings_count 2 1.00 9.64 39.82 0.00 1.00 1.00 4.00 2044.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
host_total_listings_count 2 1.00 9.64 39.82 0.00 1.00 1.00 4.00 2044.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
latitude 0 1.00 50.84 0.02 50.77 50.83 50.84 50.85 50.90 <U+2581><U+2583><U+2587><U+2582><U+2581>
longitude 0 1.00 4.36 0.03 4.26 4.34 4.36 4.38 4.48 <U+2581><U+2583><U+2587><U+2582><U+2581>
accommodates 0 1.00 3.01 1.77 0.00 2.00 2.00 4.00 16.00 <U+2587><U+2583><U+2581><U+2581><U+2581>
bathrooms 31 0.99 1.19 0.56 0.00 1.00 1.00 1.00 19.50 <U+2587><U+2581><U+2581><U+2581><U+2581>
bedrooms 630 0.88 1.40 1.05 1.00 1.00 1.00 2.00 40.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
beds 83 0.98 1.71 1.26 0.00 1.00 1.00 2.00 16.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
price 0 1.00 87.13 132.37 0.00 46.00 65.00 92.00 5000.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights 0 1.00 10.29 36.19 1.00 1.00 2.00 4.00 1125.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights 0 1.00 2339.13 120486.31 1.00 90.00 1125.00 1125.00 8888888.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights_avg_ntm 1 1.00 10.27 35.98 1.00 1.00 2.00 4.10 1125.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights_avg_ntm 1 1.00 2472.10 120495.39 1.00 365.00 1125.00 1125.00 8888888.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews 0 1.00 35.37 69.70 0.00 2.00 8.00 35.00 782.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
review_scores_rating 914 0.83 4.59 0.65 0.00 4.50 4.75 4.92 5.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
availability_30 0 1.00 9.09 10.77 0.00 0.00 3.00 19.00 30.00 <U+2587><U+2582><U+2581><U+2582><U+2582>
reviews_per_month 914 0.83 1.37 1.67 0.01 0.27 0.77 1.84 12.34 <U+2587><U+2581><U+2581><U+2581><U+2581>

2.1.2.1 Skim Summary of variables of interest

Based on the results of skim, there are 229 missing values in description, 2230 missing in neighborhood_overview, 2 missing in host_name, 16 missing in host_location, 2656 missing in host_about, 2 missing in host_response time, host_response_rate, host_acceptance_rate, host_thumbnail_rul, host_picture_url, 2028 missing in host_neighbourhood, 2230 missing in neighbourhood, 12 missing in bathrooms_text, 2 missing in host_since, 914 missing in first_review and last_review, 2 missing in host_is_superhost, host_has_profile_pic, host_identity_verified, 5442 missing in neighbourhood_group_cleansed, bathrooms, calendar_updated and license. Furthermore, there are 16 missing in host_location, 2 missing in host_response time, host_response_rate, host_acceptance_rate, 2028 missing in host_neighbourhood,2 missing in host_since,2 missing in host_is_superhost, host_has_profile_pic, host_identity_verified.

2.2 Propery types

new_listings %>% 
  count(property_type) %>% 
  arrange(desc(`n`)) %>% 
  pivot_wider(names_from = property_type, values_from = n) %>% 
  mutate(total = rowSums(.)) %>% 
  pivot_longer(col = 1:45, names_to = 'property_type', values_to = 'count' ) %>% 
  mutate(proportion = count / total)
totalproperty_typecountproportion
5.44e+03Entire rental unit28660.527   
5.44e+03Private room in rental unit7160.132   
5.44e+03Entire condominium (condo)2980.0548  
5.44e+03Private room in residential home2830.052   
5.44e+03Entire serviced apartment2190.0402  
5.44e+03Private room in townhouse1650.0303  
5.44e+03Entire residential home1590.0292  
5.44e+03Entire loft1440.0265  
5.44e+03Private room in bed and breakfast1060.0195  
5.44e+03Private room in condominium (condo)810.0149  
5.44e+03Entire townhouse650.0119  
5.44e+03Room in hotel620.0114  
5.44e+03Private room in loft390.00717 
5.44e+03Room in boutique hotel300.00551 
5.44e+03Private room in guesthouse270.00496 
5.44e+03Shared room in rental unit260.00478 
5.44e+03Room in bed and breakfast230.00423 
5.44e+03Entire guesthouse180.00331 
5.44e+03Entire guest suite170.00312 
5.44e+03Private room in guest suite150.00276 
5.44e+03Room in aparthotel150.00276 
5.44e+03Private room in villa110.00202 
5.44e+03Private room in casa particular70.00129 
5.44e+03Entire villa60.0011  
5.44e+03Private room60.0011  
5.44e+03Private room in nature lodge50.000919
5.44e+03Tiny house50.000919
5.44e+03Private room in serviced apartment40.000735
5.44e+03Room in serviced apartment40.000735
5.44e+03Shared room in condominium (condo)30.000551
5.44e+03Entire bed and breakfast20.000368
5.44e+03Private room in tiny house20.000368
5.44e+03Barn10.000184
5.44e+03Entire cottage10.000184
5.44e+03Entire place10.000184
5.44e+03Floor10.000184
5.44e+03Private room in barn10.000184
5.44e+03Private room in castle10.000184
5.44e+03Private room in dome house10.000184
5.44e+03Private room in farm stay10.000184
5.44e+03Private room in floor10.000184
5.44e+03Private room in hostel10.000184
5.44e+03Shared room10.000184
5.44e+03Shared room in residential home10.000184
5.44e+03Shared room in serviced apartment10.000184

2.2.0.1 What are the top 4 most common property types? What proportion of the total listings do they make up?

The top 4 property type are ‘Entire rental unit’, ‘Private room in rental unit’, ‘Entire condominium (condo)’, and ‘Private room in residential home’, their proportions are 52.7%, 13.2%, 5.48%, 5.20%.

new_listings <- new_listings %>%
  
 mutate(prop_type_simplified = case_when(
    property_type %in% c("Entire rental unit","Private room in rental unit", "Entire condominium (condo)","Private room in residential home") ~ property_type, 
    TRUE ~ "Other"
  ))
new_listings %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n))        
property_typeprop_type_simplifiedn
Entire rental unitEntire rental unit2866
Private room in rental unitPrivate room in rental unit716
Entire condominium (condo)Entire condominium (condo)298
Private room in residential homePrivate room in residential home283
Entire serviced apartmentOther219
Private room in townhouseOther165
Entire residential homeOther159
Entire loftOther144
Private room in bed and breakfastOther106
Private room in condominium (condo)Other81
Entire townhouseOther65
Room in hotelOther62
Private room in loftOther39
Room in boutique hotelOther30
Private room in guesthouseOther27
Shared room in rental unitOther26
Room in bed and breakfastOther23
Entire guesthouseOther18
Entire guest suiteOther17
Private room in guest suiteOther15
Room in aparthotelOther15
Private room in villaOther11
Private room in casa particularOther7
Entire villaOther6
Private roomOther6
Private room in nature lodgeOther5
Tiny houseOther5
Private room in serviced apartmentOther4
Room in serviced apartmentOther4
Shared room in condominium (condo)Other3
Entire bed and breakfastOther2
Private room in tiny houseOther2
BarnOther1
Entire cottageOther1
Entire placeOther1
FloorOther1
Private room in barnOther1
Private room in castleOther1
Private room in dome houseOther1
Private room in farm stayOther1
Private room in floorOther1
Private room in hostelOther1
Shared roomOther1
Shared room in residential homeOther1
Shared room in serviced apartmentOther1

Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:

  • What are the most common values for the variable minimum_nights?
  • Is ther any value among the common values that stands out?
  • What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?
new_listings %>% 
  mutate(minimum_nights = as.factor(minimum_nights)) %>% 
  group_by(minimum_nights) %>% 
  count() %>% 
  arrange(desc(n))
# A tibble: 61 x 2
# Groups:   minimum_nights [61]
   minimum_nights     n
   <fct>          <int>
 1 1               1656
 2 2               1528
 3 3                661
 4 5                275
 5 4                257
 6 7                233
 7 90               131
 8 30               105
 9 6                 76
10 14                72
# ... with 51 more rows

2.2.0.2 What are the most common values for the variable minimum_nights?

The most common value is 1 day.

2.2.0.3 Is there any value among the common values that stands out?

90, 30 days stand out among those common values

2.2.0.4 What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?

The unusual values are either 1 month or 1 quater, which indicates that house hosts have high intention to let their house for long-term purpose (1 month or 1 quater)

new_listings <- new_listings %>% 
  filter(minimum_nights <= 4) #filtering data to only allow have a min of 4 nights

3 Mapping

leaflet(data = filter(listings, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

Step 3: * Creating informative visualizations. * ggplot2::ggplot() * geom_histogram() or geom_density() for numeric continuous variables * geom_bar() or geom_col() for categorical variables * GGally::ggpairs() for scaterrlot/correlation matrix * Note that you can add transparency to points/density plots in the aes call, for example: aes(colour = gender, alpha = 0.4)

3.1 Visualizations

What we intend to do in the next section is use the visualisation tools we have learnt so far to answer some interesting questions we have come up with it.

3.1.1 What type of room is most common in Brussels (In terms of Number of Rooms) and on average how many people do these room types accomodate?

# Creating the data table that we will use to plot the top room frequency graph
top_room_type <- new_listings %>% 
  group_by(room_type) %>% 
  summarise(room_type_count = count(room_type))

# Creating the data table that we will use to plot the top average accommodating room type graph
average_number_accomodated_by_room_type <- new_listings %>% 
  group_by(room_type) %>% 
  summarise(average_accomodated = mean(accommodates))


room_type_bar_graph <- top_room_type %>% 
  ggplot(aes(x = room_type_count, y = fct_reorder(room_type, room_type_count))) +
  geom_col(fill='yellow') +
  theme_bw()+
  labs(
    title = "What type of listings (by room type) are most common in Brussels?",
    subtitle = NULL,
    x = "Count",
    y = NULL )

room_type_bar_graph

average_number_accomodated_bar_graph <- average_number_accomodated_by_room_type %>% 
   ggplot(aes(x = average_accomodated, y = fct_reorder(room_type, average_accomodated))) +
   geom_col(fill='blue') +
   theme_bw()+
   labs(
    title = "On average, which type of room accomodates the most people?",
    subtitle = NULL,
    x = "Average Number Accomodated",
    y = NULL )

average_number_accomodated_bar_graph

3.1.1.1 Comments and Analysis

The objective of this first question was to determine the nature of AirBnB listings (by type of rooms) and, as a result, understand the nature of house ownership in Brussels. Furthermore, these graphs can help us conjecture, qualitatively, whether the capcity (in terms of numbers accomodated) has any bearing on the frequency of the type of AirBnB listings.
Firstly, we would hypothesize that, to an extent, the higher the number accomodated by a type of a room, the more their frequency as an AirBnB listing. The results suggest that this is partially true, namely that Entire apartments are the most common and the ones that accomodate the most. This makes sense as the larger room-types are what can be charged slightly more for. We do see that this relationship doesn’t hold for hotel rooms. This also makes sense as Hotel Rooms are generally not listed on AirBnB but their own websites.

3.1.2 What is the AirBnB Price distribution in Brussels?

price_by_prop_type_histo <- ggplot(new_listings, aes(x = price))+ 
  geom_bar(fill = 'violet') +
  theme_bw()+
  xlim(0,300)+
  facet_wrap(~ room_type)+
  labs(title = "AirBnB listings in Brussels' price distribution",
         x = "Price", 
         y = "")
price_by_prop_type_histo 

price_by_prop_type_density <- ggplot(new_listings, aes(x = price))+ 
  geom_density(fill = 'orange')+
  theme_bw()+
  xlim(0,300)+
  facet_wrap(~ room_type)+
  labs(title = "AirBnB listings in Brussels' price distribution",
         x = "Price", 
         y = "", )
price_by_prop_type_density

3.1.2.1 Comments and Analysis

The objective here was to identify how the price of listings are distributed with the listings grouped by room-type. What we note here is that all the distributions are right skewed and (pretty much) multi-modal. The implicaiton here of this result is that the mean price is greater than the median. Speciically, there are certain listings that are priced significantly above the typical price, resultantly skewing the distribution. This is a result we can expect, particullarly in the context of luxury listings.

3.1.4 How is Host quality, measured by whether a Host is a Super Host or not, correlated with Price?

# Converting a logical into binary variable
cols <- sapply(new_listings, is.logical)
new_listings[,cols] <- lapply(new_listings[,cols], as.numeric)

ggplot(new_listings, aes(x = host_is_superhost , y= price))+ 
  geom_point(colour = 'red') +
  theme_bw()+
  labs(title = "Relationship between a host being a superhost and the price they charge",
         x = "Superhost? (1 = Yes, 0 = No)", 
         y = "Price")

dummy_model_quality <- lm(price ~ host_is_superhost, data = new_listings)
summary_dummy <- summary(dummy_model_quality)
print(summary_dummy)

Call:
lm(formula = price ~ host_is_superhost, data = new_listings)

Residuals:
   Min     1Q Median     3Q    Max 
 -93.9  -42.9  -23.9    6.1 4906.1 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         93.889      2.403  39.067   <2e-16 ***
host_is_superhost  -13.344      5.431  -2.457   0.0141 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 138 on 4099 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.00147,   Adjusted R-squared:  0.001227 
F-statistic: 6.036 on 1 and 4099 DF,  p-value: 0.01406

3.1.4.1 Comments and Analysis

Besides charting a scatter-plot and trying to conjecture on the relationship between the x and y variable, the objective of this question was to convert a logical categorical variable into a binary regressor and then interpret the results of a linear model. The interpretation of the results are not that of a typical linear regression that we can interpret as usual.  The hypothesis here was that, on average, superhosts charge more for listings than non-superhosts. Instead, we get the statistically significant result at the 95% level that, on average, superhosts charge $13.34 less than non-superhosts. This is an interesting result. We need to realise that of course, there are liklely to be various confounding variables and that what we have here is far from causality. But, based solely on this bivariate model, we can come up with the following story as to why we see this relationship: The reason why a certain host is a superhost is because they don’t overcharge.

3.1.5 Analysing multiple correlations with the help of GGpairs

In the following analysis, we use the GGpairs plot to qualitatively answer a set of questions based on relationships between two variables.

# GGpairs plot to answer it all
new_listings %>% 
  select(price, minimum_nights, maximum_nights, beds, host_identity_verified, review_scores_rating) %>% 
  ggpairs(aes(alpha=0.1))+
  theme_bw()

3.1.5.1 What is the relationship between the minimum number of nights you can book a listing and the listings price?

We would conjecture that as the number of minimum nights increases, the listing price would decrease and this is in fact the case (correlations, not causation). The negative correlation of -0.043, albeit weak, makes sense as setting a higher minimum number nights is a restriction for customers that has to compensated for with lower prices by the host.

3.1.5.2 What is the relationship between the number of beds and listing’s price?

We would expect that the correlation between number of beds and price to be positive. The results describe a weak positive correlation of 0.252. The logic behind this is relatively obvious. Bigger beds would imply a bigger house/unit. The customer would be expected to pay for a bigger house. The reason why the correlation isn’t strong, howerver, could be because of the confounding factors we have not factored in such as location, for example.

3.1.5.3 What is the relationship between whether a host is verified and the listing’s price?

We could expect that if a host is verified, they have more credibility and thus can charge a higher price. Oddly, there is a very weak negative correlation of -0.009. Again, there are numerous confounders that would stop us from making conclusive statements. Perhaps adding other regressors would return a positive correlation.

4 Regression Analysis

#Creation of variable price_4_nights
new_listings <- new_listings %>%
  filter(accommodates > 1) %>%
  mutate(price_4_nights = price * 4)
#creating new variables called `neighbourhood_simplified` for later regression

new_listings <- new_listings %>%
  mutate(neighbourhood_simplified = case_when(neighbourhood_cleansed %in% c("Jette","Berchem-Sainte-Agathe","Koekelberg", "Molenbeek-Saint-Jean", 
                                       "Ganshoren") ~ "North West", 
                                       neighbourhood_cleansed %in% c( "Saint-Josse-ten-Noode", "Schaerbeek",  "Bruxelles", "Evere") ~ "North East", 
                                      neighbourhood_cleansed %in% c("Woluwe-Saint-Lambert", "Woluwe-Saint-Pierre","Auderghem", "Etterbeek") ~ "East/Centre",
                                      neighbourhood_cleansed %in% c("Saint-Gilles", "Anderlecht", "Forest") ~ " West/Centre",
                                      neighbourhood_cleansed %in% c("Ixelles", "Uccle", "Watermael-Boitsfort") ~ "South/Centre"))
#Creation of new variable log-Price_4_nights
new_listings <- new_listings %>%
  mutate(log_price_4_nights = log(price_4_nights))

#Creating a histogram to examine distribution of price_4_nights
ggplot(data = new_listings, aes(x = price_4_nights)) +
  geom_histogram(color = "white", fill = "steelblue") +
  theme_bw() +
  labs(title = "Distribution of price_4_nights in histogram graph",
         x = "price_4_nights", 
         y = "")

#Creating a histogram to examine distribution of log(price_4_nights)
ggplot(data = new_listings, aes(x = log_price_4_nights)) +
  geom_histogram(color = "white", fill = "steelblue") +
  theme_bw() +
  labs(title = "Distribution of log(price_4_nights) in histogram graph",
         x = "log(price_4_nights)", 
         y = "")

4.0.0.1 Which variable should you use for the regression model? Why?

We should use log(price_4_nights) for the model as its distribution is a normal distribution.

4.1 Set training and testing dataset

library(rsample)
set.seed(1234)

#new_listings <- new_listings %>% na.omit()  #drop na
train_test_split <- initial_split(new_listings, prop = 0.7)
train_data <- training(train_test_split)
test_data <- testing(train_test_split)

4.2 Model 1

#checking the types of prop_type-simplified
new_listings %>%
  group_by(prop_type_simplified) %>%
  summarise(n = n()) %>% 
  arrange(desc(n))
prop_type_simplifiedn
Entire rental unit2065
Other906
Private room in rental unit451
Entire condominium (condo)205
Private room in residential home166
# Fit regression model
model1 <-lm(log(price_4_nights) ~ prop_type_simplified + number_of_reviews + review_scores_rating, data = train_data)

msummary(model1)
                                                       Estimate Std. Error
(Intercept)                                           5.8825214  0.0956056
prop_type_simplifiedEntire rental unit               -0.0170932  0.0522687
prop_type_simplifiedOther                             0.1614693  0.0551674
prop_type_simplifiedPrivate room in rental unit      -0.5634344  0.0598195
prop_type_simplifiedPrivate room in residential home -0.4304863  0.0718322
number_of_reviews                                    -0.0006096  0.0001336
review_scores_rating                                 -0.0239373  0.0174873
                                                     t value Pr(>|t|)    
(Intercept)                                           61.529  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -0.327  0.74368    
prop_type_simplifiedOther                              2.927  0.00346 ** 
prop_type_simplifiedPrivate room in rental unit       -9.419  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home  -5.993 2.39e-09 ***
number_of_reviews                                     -4.563 5.32e-06 ***
review_scores_rating                                  -1.369  0.17118    

Residual standard error: 0.5301 on 2279 degrees of freedom
  (369 observations deleted due to missingness)
Multiple R-squared:  0.1565,    Adjusted R-squared:  0.1542 
F-statistic: 70.45 on 6 and 2279 DF,  p-value: < 2.2e-16
autoplot(model1)

4.2.0.1 Interpreting the coefficient review_scores_rating in terms of price_4_nights.

At first glance, there is a negative relationship between review_scores_ratings and price_4_nights, which seems strange given that normally we would expect properties having higher ratings will have higher prices. However, the negative relationship is very small and is nearly zero and it is not statistically significant. So we have 95% confidence to see review_scores_rating does not have too much effect on price_4_nights.

4.2.0.2 Interpreting the coefficient of prop_type_simplified in terms of price_4_nights.

prop_type_simplified is a categorical variable, so the first thing we should understand is this regression is choosing entire condo as a base line. The intercept can be interpreted as an entire condominium (condo) will command a log price_4_nights of 5.883. If another property type is chosen such as a private room in rental unit or a private room in residential home, then the log price will be decreased by 0.563 and 0.430 respectively. This make sense as the price of renting a room will be lower than that of an entire condo.

4.2.1 Checking for Overfitting

# testing overfit
RMSE_model1 <- test_data %>% 
  mutate(predictions = predict(model1, .),
         R = predictions - log_price_4_nights) %>%#. automatically fund data we need 
  select(R) %>% 
  na.omit() %>%  # omit all the NA values in residual
  summarise(RMSE = sqrt(sum(R**2 / n()))) %>% 
  pull()
RMSE_model1
[1] 0.5181594

4.3 Model 2

We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explananatory variables in model1 plus room_type.

Since review_score_rating is not a significant variable, we don’t put it in our regression model

# Fit regression model
model2 <-lm (log_price_4_nights ~ prop_type_simplified + number_of_reviews + room_type, data = train_data)

msummary(model2)
                                                       Estimate Std. Error
(Intercept)                                           5.7926328  0.0447861
prop_type_simplifiedEntire rental unit               -0.0179112  0.0464776
prop_type_simplifiedOther                             0.4798858  0.0526092
prop_type_simplifiedPrivate room in rental unit       0.1145401  0.0673437
prop_type_simplifiedPrivate room in residential home  0.2094071  0.0763455
number_of_reviews                                    -0.0008301  0.0001243
room_typeHotel room                                  -0.1110830  0.0968612
room_typePrivate room                                -0.6535518  0.0416335
room_typeShared room                                 -1.2191422  0.1491773
                                                     t value Pr(>|t|)    
(Intercept)                                          129.340  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -0.385  0.69999    
prop_type_simplifiedOther                              9.122  < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit        1.701  0.08909 .  
prop_type_simplifiedPrivate room in residential home   2.743  0.00613 ** 
number_of_reviews                                     -6.676 2.99e-11 ***
room_typeHotel room                                   -1.147  0.25156    
room_typePrivate room                                -15.698  < 2e-16 ***
room_typeShared room                                  -8.172 4.63e-16 ***

Residual standard error: 0.5075 on 2646 degrees of freedom
Multiple R-squared:  0.2394,    Adjusted R-squared:  0.2371 
F-statistic: 104.1 on 8 and 2646 DF,  p-value: < 2.2e-16

Except for room_typeHotel room, other Room type is a significant predictor of price as see by t-statistics.

4.3.1 Checking for Overfitting

# testing overfit
RMSE_model2 <- test_data %>% 
  mutate(predictions = predict(model2, .),
         R = predictions - log_price_4_nights) %>%#. automatically fund data we need 
  select(R) %>% 
  na.omit() %>%  # omit all the NA values in residual
  summarise(RMSE = sqrt(sum(R**2 / n()))) %>% 
  pull()
RMSE_model2
[1] 0.5041794
autoplot(model2)

4.4 Extending our Analysis: Our Models

4.5 Model 3

# Fit regression model
model3 <-lm (log(price_4_nights) ~ prop_type_simplified + number_of_reviews + room_type + bathrooms + bedrooms + beds + accommodates  , data = train_data)

msummary(model3)
                                                       Estimate Std. Error
(Intercept)                                           5.2901290  0.0508939
prop_type_simplifiedEntire rental unit               -0.0074623  0.0463288
prop_type_simplifiedOther                             0.2371531  0.0523933
prop_type_simplifiedPrivate room in rental unit      -0.1082859  0.0646830
prop_type_simplifiedPrivate room in residential home -0.0282529  0.0725612
number_of_reviews                                    -0.0007703  0.0001178
room_typeHotel room                                   0.2332800  0.0990694
room_typePrivate room                                -0.3025114  0.0412916
room_typeShared room                                 -0.8132468  0.1340096
bathrooms                                             0.0551362  0.0181127
bedrooms                                              0.0363091  0.0124805
beds                                                 -0.0059722  0.0119793
accommodates                                          0.1202730  0.0093949
                                                     t value Pr(>|t|)    
(Intercept)                                          103.944  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -0.161  0.87205    
prop_type_simplifiedOther                              4.526 6.30e-06 ***
prop_type_simplifiedPrivate room in rental unit       -1.674  0.09425 .  
prop_type_simplifiedPrivate room in residential home  -0.389  0.69704    
number_of_reviews                                     -6.539 7.59e-11 ***
room_typeHotel room                                    2.355  0.01862 *  
room_typePrivate room                                 -7.326 3.24e-13 ***
room_typeShared room                                  -6.069 1.50e-09 ***
bathrooms                                              3.044  0.00236 ** 
bedrooms                                               2.909  0.00366 ** 
beds                                                  -0.499  0.61815    
accommodates                                          12.802  < 2e-16 ***

Residual standard error: 0.4511 on 2333 degrees of freedom
  (309 observations deleted due to missingness)
Multiple R-squared:  0.4159,    Adjusted R-squared:  0.4129 
F-statistic: 138.4 on 12 and 2333 DF,  p-value: < 2.2e-16

4.5.1 Checking for Colinearity

car::vif(model3)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 4.111669  4        1.193307
number_of_reviews    1.009273  1        1.004626
room_type            4.278923  3        1.274155
bathrooms            1.577555  1        1.256008
bedrooms             1.871598  1        1.368064
beds                 3.049900  1        1.746396
accommodates         3.438294  1        1.854264

4.5.1.1 Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?

The number of beds is not significant predictors of price_4_nights. However, the numberof bedrooms, bathrooms and size of the house are significant predictors. Given VIF is less than 5, it doesn’t seem that there is any issue of multi-collinearity.

4.5.2 Checking for Overfitting

# testing overfit
RMSE_model3 <- test_data %>% 
  mutate(predictions = predict(model3, .),
         R = predictions - log_price_4_nights) %>%#. automatically fund data we need 
  select(R) %>% 
  na.omit() %>%  # omit all the NA values in residual
  summarise(RMSE = sqrt(sum(R**2 / n()))) %>% 
  pull()
RMSE_model3
[1] 0.441413
autoplot(model3)

4.6 Model 4

4.6.0.1 Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?

Since beds is not a significant variable, we discard it

# Fit regression model
model4 <-lm (log_price_4_nights ~ prop_type_simplified +review_scores_rating+ number_of_reviews + room_type + bathrooms + bedrooms + accommodates + host_is_superhost  , data = train_data)

msummary(model4)
                                                       Estimate Std. Error
(Intercept)                                           5.3313554  0.0893880
prop_type_simplifiedEntire rental unit                0.0050842  0.0486753
prop_type_simplifiedOther                             0.2335189  0.0548296
prop_type_simplifiedPrivate room in rental unit      -0.0342994  0.0687437
prop_type_simplifiedPrivate room in residential home  0.0863829  0.0763989
review_scores_rating                                 -0.0161321  0.0154985
number_of_reviews                                    -0.0006286  0.0001197
room_typeHotel room                                   0.2902014  0.1082220
room_typePrivate room                                -0.3808567  0.0443450
room_typeShared room                                 -0.7862292  0.1309131
bathrooms                                             0.0389981  0.0182821
bedrooms                                              0.0291645  0.0122714
accommodates                                          0.1279341  0.0072206
host_is_superhost                                    -0.0049142  0.0244515
                                                     t value Pr(>|t|)    
(Intercept)                                           59.643  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 0.104  0.91682    
prop_type_simplifiedOther                              4.259 2.15e-05 ***
prop_type_simplifiedPrivate room in rental unit       -0.499  0.61787    
prop_type_simplifiedPrivate room in residential home   1.131  0.25832    
review_scores_rating                                  -1.041  0.29805    
number_of_reviews                                     -5.251 1.67e-07 ***
room_typeHotel room                                    2.682  0.00739 ** 
room_typePrivate room                                 -8.588  < 2e-16 ***
room_typeShared room                                  -6.006 2.25e-09 ***
bathrooms                                              2.133  0.03304 *  
bedrooms                                               2.377  0.01756 *  
accommodates                                          17.718  < 2e-16 ***
host_is_superhost                                     -0.201  0.84074    

Residual standard error: 0.4386 on 2010 degrees of freedom
  (631 observations deleted due to missingness)
Multiple R-squared:  0.4472,    Adjusted R-squared:  0.4436 
F-statistic: 125.1 on 13 and 2010 DF,  p-value: < 2.2e-16

4.6.1 Checking for Colinearity

car::vif(model4)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 4.306063  4        1.200218
review_scores_rating 1.049621  1        1.024510
number_of_reviews    1.050964  1        1.025165
room_type            4.404961  3        1.280335
bathrooms            1.559848  1        1.248939
bedrooms             1.776568  1        1.332879
accommodates         1.868890  1        1.367073
host_is_superhost    1.091315  1        1.044660

4.6.2 Checking for Overfitting

# testing overfit
RMSE_model4 <- test_data %>% 
  mutate(predictions = predict(model4, .),
         R = predictions - log_price_4_nights) %>%#. automatically fund data we need 
  select(R) %>% 
  na.omit() %>%  # omit all the NA values in residual
  summarise(RMSE = sqrt(sum(R**2 / n()))) %>% 
  pull()
RMSE_model4
[1] 0.4181949

4.6.2.1 Key Comments

At first glance, being a superhost seems command a pricing premium compared to being not. However, it is not statistically significant. So we have 95% confidence to say being a superhost doesn’t command a pricing premium.

autoplot(model4)

4.7 Model 5

# Fit regression model
model5 <-lm (log(price_4_nights) ~ prop_type_simplified + number_of_reviews  + room_type + bathrooms + bedrooms + 
               accommodates +  instant_bookable , data = train_data)

msummary(model5)
                                                       Estimate Std. Error
(Intercept)                                           5.2426611  0.0518372
prop_type_simplifiedEntire rental unit                0.0090906  0.0464155
prop_type_simplifiedOther                             0.2419394  0.0524729
prop_type_simplifiedPrivate room in rental unit      -0.0826566  0.0649542
prop_type_simplifiedPrivate room in residential home  0.0009518  0.0727467
number_of_reviews                                    -0.0008397  0.0001188
room_typeHotel room                                   0.1813531  0.1001331
room_typePrivate room                                -0.3121019  0.0414436
room_typeShared room                                 -0.8107628  0.1346822
bathrooms                                             0.0585907  0.0181705
bedrooms                                              0.0355044  0.0123689
accommodates                                          0.1175597  0.0069911
instant_bookable                                      0.0921779  0.0196610
                                                     t value Pr(>|t|)    
(Intercept)                                          101.137  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 0.196  0.84474    
prop_type_simplifiedOther                              4.611 4.23e-06 ***
prop_type_simplifiedPrivate room in rental unit       -1.273  0.20331    
prop_type_simplifiedPrivate room in residential home   0.013  0.98956    
number_of_reviews                                     -7.069 2.05e-12 ***
room_typeHotel room                                    1.811  0.07025 .  
room_typePrivate room                                 -7.531 7.15e-14 ***
room_typeShared room                                  -6.020 2.02e-09 ***
bathrooms                                              3.224  0.00128 ** 
bedrooms                                               2.870  0.00414 ** 
accommodates                                          16.816  < 2e-16 ***
instant_bookable                                       4.688 2.91e-06 ***

Residual standard error: 0.4534 on 2350 degrees of freedom
  (292 observations deleted due to missingness)
Multiple R-squared:  0.4158,    Adjusted R-squared:  0.4128 
F-statistic: 139.4 on 12 and 2350 DF,  p-value: < 2.2e-16

4.7.1 Checking for Colinearity

car::vif(model5)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 4.218653  4        1.197145
number_of_reviews    1.017779  1        1.008850
room_type            4.353848  3        1.277847
bathrooms            1.573146  1        1.254251
bedrooms             1.822213  1        1.349894
accommodates         1.890973  1        1.375126
instant_bookable     1.053686  1        1.026492

4.7.2 Checking for Overfitting

# testing overfit
RMSE_model5 <- test_data %>% 
  mutate(predictions = predict(model5, .),
         R = predictions - log_price_4_nights) %>%#. automatically fund data we need 
  select(R) %>% 
  na.omit() %>%  # omit all the NA values in residual
  summarise(RMSE = sqrt(sum(R**2 / n()))) %>% 
  pull()
RMSE_model5
[1] 0.4411484
autoplot(model5)

4.7.2.1 After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?

Instant_bookable is a significant predictor of price as seen by t statistics.

4.7.2.2 Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighbourhoods together so the majority of listings falls in fewer (5-6 max) geographical areas.

We have a member of our study group from Brussels. He suggests we group neighbourhoods into ‘North West’, ‘North East’, ‘East’, ‘West’, ‘South’

new_listings <- new_listings %>%
  mutate(neighbourhood_simplified = case_when(neighbourhood_cleansed %in% c("Jette","Berchem-Sainte-Agathe","Koekelberg", "Molenbeek-Saint-Jean", 
                                       "Ganshoren") ~ "North West", 
                                       neighbourhood_cleansed %in% c( "Saint-Josse-ten-Noode", "Schaerbeek",  "Bruxelles", "Evere") ~ "North East", 
                                      neighbourhood_cleansed %in% c("Woluwe-Saint-Lambert", "Woluwe-Saint-Pierre","Auderghem", "Etterbeek") ~ "East/Centre",
                                      neighbourhood_cleansed %in% c("Saint-Gilles", "Anderlecht", "Forest") ~ " West/Centre",
                                      neighbourhood_cleansed %in% c("Ixelles", "Uccle", "Watermael-Boitsfort") ~ "South/Centre"))

4.8 Model 6

# Fit regression model
model6 <-lm (log_price_4_nights ~ prop_type_simplified + number_of_reviews  + room_type + bathrooms + bedrooms  + 
               accommodates + instant_bookable + neighbourhood_simplified , data = train_data)

msummary(model6)
                                                       Estimate Std. Error
(Intercept)                                           5.1873923  0.0545054
prop_type_simplifiedEntire rental unit                0.0044044  0.0459951
prop_type_simplifiedOther                             0.2381880  0.0519935
prop_type_simplifiedPrivate room in rental unit      -0.0954069  0.0644310
prop_type_simplifiedPrivate room in residential home  0.0063421  0.0722624
number_of_reviews                                    -0.0008850  0.0001181
room_typeHotel room                                   0.1656419  0.0994094
room_typePrivate room                                -0.2975245  0.0411531
room_typeShared room                                 -0.8018106  0.1334929
bathrooms                                             0.0578702  0.0180054
bedrooms                                              0.0377215  0.0122794
accommodates                                          0.1164714  0.0069395
instant_bookable                                      0.0829841  0.0195798
neighbourhood_simplifiedEast/Centre                   0.0077999  0.0373416
neighbourhood_simplifiedNorth East                    0.1231149  0.0254769
neighbourhood_simplifiedNorth West                   -0.0983502  0.0422654
neighbourhood_simplifiedSouth/Centre                  0.0701086  0.0288843
                                                     t value Pr(>|t|)    
(Intercept)                                           95.172  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 0.096  0.92372    
prop_type_simplifiedOther                              4.581 4.87e-06 ***
prop_type_simplifiedPrivate room in rental unit       -1.481  0.13880    
prop_type_simplifiedPrivate room in residential home   0.088  0.93007    
number_of_reviews                                     -7.496 9.25e-14 ***
room_typeHotel room                                    1.666  0.09580 .  
room_typePrivate room                                 -7.230 6.52e-13 ***
room_typeShared room                                  -6.006 2.19e-09 ***
bathrooms                                              3.214  0.00133 ** 
bedrooms                                               3.072  0.00215 ** 
accommodates                                          16.784  < 2e-16 ***
instant_bookable                                       4.238 2.34e-05 ***
neighbourhood_simplifiedEast/Centre                    0.209  0.83456    
neighbourhood_simplifiedNorth East                     4.832 1.44e-06 ***
neighbourhood_simplifiedNorth West                    -2.327  0.02005 *  
neighbourhood_simplifiedSouth/Centre                   2.427  0.01529 *  

Residual standard error: 0.4491 on 2346 degrees of freedom
  (292 observations deleted due to missingness)
Multiple R-squared:  0.4277,    Adjusted R-squared:  0.4238 
F-statistic: 109.6 on 16 and 2346 DF,  p-value: < 2.2e-16

4.8.1 Checking for Colinearity

car::vif(model6)
                             GVIF Df GVIF^(1/(2*Df))
prop_type_simplified     4.278407  4        1.199251
number_of_reviews        1.024343  1        1.012098
room_type                4.398691  3        1.280031
bathrooms                1.574201  1        1.254672
bedrooms                 1.830238  1        1.352863
accommodates             1.898747  1        1.377950
instant_bookable         1.064959  1        1.031968
neighbourhood_simplified 1.059886  4        1.007297

4.8.2 Checking for Overfitting

# testing overfit
RMSE_model6 <- test_data %>% 
  mutate(predictions = predict(model5, .),
         R = predictions - log_price_4_nights) %>%#. automatically fund data we need 
  select(R) %>% 
  na.omit() %>%  # omit all the NA values in residual
  summarise(RMSE = sqrt(sum(R**2 / n()))) %>% 
  pull()
RMSE_model6
[1] 0.4411484
autoplot(model6)

4.8.2.1 Key Comments

Location is a good significant predictor of price_4_nights as seen by t-statistics. Rooms located in the East won’t have a significant effect on price, however, rooms located in North East, North West, South have significant postive effect on price_4_night

  1. What is the effect of avalability_30 or reviews_per_month on price_4_nights, after we control for other variables?

4.9 Model 7

# Fit regression model
model7 <-lm (log_price_4_nights ~ prop_type_simplified +  number_of_reviews + room_type + bathrooms + bedrooms +  
               accommodates +  instant_bookable + neighbourhood_simplified + reviews_per_month + availability_30 , data = train_data)

# Get regression table:
msummary(model7)
                                                       Estimate Std. Error
(Intercept)                                           5.151e+00  5.247e-02
prop_type_simplifiedEntire rental unit               -1.150e-02  4.381e-02
prop_type_simplifiedOther                             1.595e-01  4.944e-02
prop_type_simplifiedPrivate room in rental unit      -7.750e-02  6.209e-02
prop_type_simplifiedPrivate room in residential home  4.815e-02  6.903e-02
number_of_reviews                                     6.408e-05  1.343e-04
room_typeHotel room                                   9.306e-02  9.790e-02
room_typePrivate room                                -4.225e-01  4.001e-02
room_typeShared room                                 -8.448e-01  1.178e-01
bathrooms                                             3.724e-02  1.640e-02
bedrooms                                              3.340e-02  1.102e-02
accommodates                                          1.200e-01  6.486e-03
instant_bookable                                      6.457e-02  1.877e-02
neighbourhood_simplifiedEast/Centre                   1.076e-03  3.540e-02
neighbourhood_simplifiedNorth East                    9.706e-02  2.435e-02
neighbourhood_simplifiedNorth West                   -1.475e-01  4.062e-02
neighbourhood_simplifiedSouth/Centre                  5.262e-02  2.728e-02
reviews_per_month                                    -5.345e-02  6.338e-03
availability_30                                       1.628e-02  8.694e-04
                                                     t value Pr(>|t|)    
(Intercept)                                           98.178  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -0.263 0.792948    
prop_type_simplifiedOther                              3.226 0.001274 ** 
prop_type_simplifiedPrivate room in rental unit       -1.248 0.212070    
prop_type_simplifiedPrivate room in residential home   0.698 0.485547    
number_of_reviews                                      0.477 0.633356    
room_typeHotel room                                    0.951 0.341927    
room_typePrivate room                                -10.560  < 2e-16 ***
room_typeShared room                                  -7.173 1.03e-12 ***
bathrooms                                              2.270 0.023315 *  
bedrooms                                               3.031 0.002472 ** 
accommodates                                          18.497  < 2e-16 ***
instant_bookable                                       3.441 0.000592 ***
neighbourhood_simplifiedEast/Centre                    0.030 0.975751    
neighbourhood_simplifiedNorth East                     3.986 6.97e-05 ***
neighbourhood_simplifiedNorth West                    -3.631 0.000290 ***
neighbourhood_simplifiedSouth/Centre                   1.929 0.053932 .  
reviews_per_month                                     -8.433  < 2e-16 ***
availability_30                                       18.729  < 2e-16 ***

Residual standard error: 0.393 on 2006 degrees of freedom
  (630 observations deleted due to missingness)
Multiple R-squared:  0.5569,    Adjusted R-squared:  0.553 
F-statistic: 140.1 on 18 and 2006 DF,  p-value: < 2.2e-16

4.9.0.1 Key Comments

For this model, we find number_of_reviews is not significant, then we try to replace it with review_scores_rating, then this is significant. This might because reviews_per_month could represent much information of number_of_review, so this variable become insignificant.

# Fit regression model
model7 <-lm (log_price_4_nights ~ prop_type_simplified + review_scores_rating + room_type + bathrooms + bedrooms +  
               accommodates +  instant_bookable + neighbourhood_simplified + reviews_per_month + availability_30 , data = train_data)

# Get regression table:
msummary(model7)
                                                       Estimate Std. Error
(Intercept)                                           4.9696786  0.0839576
prop_type_simplifiedEntire rental unit               -0.0084988  0.0436885
prop_type_simplifiedOther                             0.1635574  0.0491512
prop_type_simplifiedPrivate room in rental unit      -0.0685048  0.0616751
prop_type_simplifiedPrivate room in residential home  0.0507285  0.0685730
review_scores_rating                                  0.0383368  0.0139127
room_typeHotel room                                   0.0863199  0.0977376
room_typePrivate room                                -0.4252994  0.0398456
room_typeShared room                                 -0.8333446  0.1175003
bathrooms                                             0.0366305  0.0163751
bedrooms                                              0.0340481  0.0110024
accommodates                                          0.1202994  0.0064751
instant_bookable                                      0.0677469  0.0187699
neighbourhood_simplifiedEast/Centre                   0.0003808  0.0353283
neighbourhood_simplifiedNorth East                    0.0969861  0.0243075
neighbourhood_simplifiedNorth West                   -0.1481909  0.0405403
neighbourhood_simplifiedSouth/Centre                  0.0494046  0.0272550
reviews_per_month                                    -0.0533501  0.0050452
availability_30                                       0.0166477  0.0008780
                                                     t value Pr(>|t|)    
(Intercept)                                           59.193  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -0.195 0.845778    
prop_type_simplifiedOther                              3.328 0.000892 ***
prop_type_simplifiedPrivate room in rental unit       -1.111 0.266815    
prop_type_simplifiedPrivate room in residential home   0.740 0.459524    
review_scores_rating                                   2.756 0.005913 ** 
room_typeHotel room                                    0.883 0.377245    
room_typePrivate room                                -10.674  < 2e-16 ***
room_typeShared room                                  -7.092 1.82e-12 ***
bathrooms                                              2.237 0.025399 *  
bedrooms                                               3.095 0.001998 ** 
accommodates                                          18.579  < 2e-16 ***
instant_bookable                                       3.609 0.000314 ***
neighbourhood_simplifiedEast/Centre                    0.011 0.991402    
neighbourhood_simplifiedNorth East                     3.990 6.85e-05 ***
neighbourhood_simplifiedNorth West                    -3.655 0.000263 ***
neighbourhood_simplifiedSouth/Centre                   1.813 0.070031 .  
reviews_per_month                                    -10.574  < 2e-16 ***
availability_30                                       18.960  < 2e-16 ***

Residual standard error: 0.3923 on 2006 degrees of freedom
  (630 observations deleted due to missingness)
Multiple R-squared:  0.5586,    Adjusted R-squared:  0.5546 
F-statistic:   141 on 18 and 2006 DF,  p-value: < 2.2e-16

4.9.1 Checking for Colinearity

car::vif(model7)
                             GVIF Df GVIF^(1/(2*Df))
prop_type_simplified     4.437179  4        1.204726
review_scores_rating     1.057403  1        1.028301
room_type                4.548614  3        1.287201
bathrooms                1.564861  1        1.250944
bedrooms                 1.786566  1        1.336625
accommodates             1.878203  1        1.370475
instant_bookable         1.092379  1        1.045169
neighbourhood_simplified 1.090815  4        1.010925
reviews_per_month        1.093100  1        1.045514
availability_30          1.106151  1        1.051737

4.9.2 Checking for Overfitting

# testing overfit
RMSE_model7 <- test_data %>% 
  mutate(predictions = predict(model7, .),
         R = predictions - log_price_4_nights) %>%#. automatically fund data we need 
  select(R) %>% 
  na.omit() %>%  # omit all the NA values in residual
  summarise(RMSE = sqrt(sum(R**2 / n()))) %>% 
  pull()
RMSE_model7
[1] 0.3703
autoplot(model7)

4.9.2.1 Key Comment

availability_30 and reviews_per_month have significant positive effect on price_4_nights

4.10 Creating summary tables

#library(huxtable)
huxreg(list('model1' = model1,
            'model2' = model2, 
            'model3' = model3, 
            'model4' = model4, 
            'model5' = model5, 
            'model6' = model6, 
            'model7' = model7),
       
       statistics = c('#observations' = 'nobs', 
                      'R squared' = 'r.squared', 
                      'Adj. R Squared' = 'adj.r.squared', 
                      'Residual SE' = 'sigma'), 
                 bold_signif = 0.05, 
                 stars = NULL
) %>% 
  set_caption('Comparison of models')
Comparison of models
model1model2model3model4model5model6model7
(Intercept)5.883 5.793 5.290 5.331 5.243 5.187 4.970 
(0.096)(0.045)(0.051)(0.089)(0.052)(0.055)(0.084)
prop_type_simplifiedEntire rental unit-0.017 -0.018 -0.007 0.005 0.009 0.004 -0.008 
(0.052)(0.046)(0.046)(0.049)(0.046)(0.046)(0.044)
prop_type_simplifiedOther0.161 0.480 0.237 0.234 0.242 0.238 0.164 
(0.055)(0.053)(0.052)(0.055)(0.052)(0.052)(0.049)
prop_type_simplifiedPrivate room in rental unit-0.563 0.115 -0.108 -0.034 -0.083 -0.095 -0.069 
(0.060)(0.067)(0.065)(0.069)(0.065)(0.064)(0.062)
prop_type_simplifiedPrivate room in residential home-0.430 0.209 -0.028 0.086 0.001 0.006 0.051 
(0.072)(0.076)(0.073)(0.076)(0.073)(0.072)(0.069)
number_of_reviews-0.001 -0.001 -0.001 -0.001 -0.001 -0.001      
(0.000)(0.000)(0.000)(0.000)(0.000)(0.000)     
review_scores_rating-0.024           -0.016           0.038 
(0.017)          (0.015)          (0.014)
room_typeHotel room     -0.111 0.233 0.290 0.181 0.166 0.086 
     (0.097)(0.099)(0.108)(0.100)(0.099)(0.098)
room_typePrivate room     -0.654 -0.303 -0.381 -0.312 -0.298 -0.425 
     (0.042)(0.041)(0.044)(0.041)(0.041)(0.040)
room_typeShared room     -1.219 -0.813 -0.786 -0.811 -0.802 -0.833 
     (0.149)(0.134)(0.131)(0.135)(0.133)(0.118)
bathrooms          0.055 0.039 0.059 0.058 0.037 
          (0.018)(0.018)(0.018)(0.018)(0.016)
bedrooms          0.036 0.029 0.036 0.038 0.034 
          (0.012)(0.012)(0.012)(0.012)(0.011)
beds          -0.006                     
          (0.012)                    
accommodates          0.120 0.128 0.118 0.116 0.120 
          (0.009)(0.007)(0.007)(0.007)(0.006)
host_is_superhost               -0.005                
               (0.024)               
instant_bookable                    0.092 0.083 0.068 
                    (0.020)(0.020)(0.019)
neighbourhood_simplifiedEast/Centre                         0.008 0.000 
                         (0.037)(0.035)
neighbourhood_simplifiedNorth East                         0.123 0.097 
                         (0.025)(0.024)
neighbourhood_simplifiedNorth West                         -0.098 -0.148 
                         (0.042)(0.041)
neighbourhood_simplifiedSouth/Centre                         0.070 0.049 
                         (0.029)(0.027)
reviews_per_month                              -0.053 
                              (0.005)
availability_30                              0.017 
                              (0.001)
#observations2286     2655     2346     2024     2363     2363     2025     
R squared0.156 0.239 0.416 0.447 0.416 0.428 0.559 
Adj. R Squared0.154 0.237 0.413 0.444 0.413 0.424 0.555 
Residual SE0.530 0.508 0.451 0.439 0.453 0.449 0.392 

4.10.0.1 Note

RMSE in the testing dataset

data_frame(RMSE_model1,RMSE_model2,RMSE_model3,RMSE_model4, RMSE_model5,
          RMSE_model6,RMSE_model7)
RMSE_model1RMSE_model2RMSE_model3RMSE_model4RMSE_model5RMSE_model6RMSE_model7
0.5180.5040.4410.4180.4410.4410.37

4.10.0.2 Note

Model 7 has the highest adjusted R^2, and also the lowest RMSE in testing set, which means model7 has the best explaining ability with no overfitting. So we use model7 for prediction.

4.11 Prediction using our model of choice

prop_type_simplified<- c( 'Private room in rental unit')
number_of_reviews <- c(10)
room_type <- c('Private room')
bathrooms <- c(1)
bedrooms <- c(1)
accommodates <- c(1)
instant_bookable <- c(1)
review_scores_rating <- c(4.5)
neighbourhood_simplified <- ('North West')
reviews_per_month <- (1.6) # take the mean value
availability_30 <- c(15)

data_project <- data.frame(prop_type_simplified, number_of_reviews, review_scores_rating, room_type, bathrooms, bedrooms, accommodates, instant_bookable,
                           neighbourhood_simplified, reviews_per_month, availability_30)


data.frame(predict(model7, newdata = data_project, interval = 'prediction')) %>% 
  mutate(fit = exp(fit),
         lwr = exp(lwr),
         upr = exp(upr))
fitlwrupr
13763.4298

4.11.1 Our Answer to the prediction

Suppose I want to order a private room in rental unit, with one bathroom and 1 bedroom. The size of the room could accomodate 1 person. Also, I want this room to be instant bookable, and its location should be in the North West. Reviews per month should be 1.6, and average rating should be 4.5 and availability in 30 days should be about 15.